Machine Learning Based Source Code Classification Using Syntax Oriented Features

نویسندگان

  • Shaul Zevin
  • Catherine Holzem
چکیده

As of today the programming language of the vast majority of the published source code is manually specified or programmatically assigned based on the sole file extension. In this paper we show that the source code programming language identification task can be fully automated using machine learning techniques. We first define the criteria that a production-level automatic programming language identification solution should meet. Our criteria include accuracy, programming language coverage, extensibility and performance. We then describe our approach: How training files are preprocessed for extracting features that mimic grammar productions, and then how these extracted ‘grammar productions’ are used for the training and testing of our classifier. We achieve a 99% accuracy rate while classifying 29 of the most popular programming languages with a Maximum Entropy classifier.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Source Code Authorship Attribution Using Long Short-Term Memory Based Networks

Machine learning approaches to source code authorship attribution attempt to find statistical regularities in human-generated source code that can identify the author or authors of that code. This has applications in plagiarism detection, intellectual property infringement, and post-incident forensics in computer security. The introduction of features derived from the Abstract Syntax Tree (AST)...

متن کامل

Using Classification Techniques to Determine Source Code Authorship

The ability to test authorship of source code is a useful technique. We present techniques for using statistical machine learning to accomplish this task. We translate source code into abstract syntax trees and then split up the trees into functions. The tree for each function is considered a document, with a given author. This collection is fed to an SVM package using a kernel that operates on...

متن کامل

A Comparative Study of Different Source Code Metrics and Machine Learning Algorithms for Predicting Change Proneness of Object Oriented Systems

Change-prone classes or modules are defined as software components in the source code which are likely to change in the future. Change-proneness prediction is useful to the maintenance team as they can optimize and focus their testing resources on the modules which have a higher likelihood of change. Change-proneness prediction model can be built by using source code metrics as predictors or fe...

متن کامل

Body Mass Index Classification based on Facial Features using Machine Learning Algorithms for utilizing in Telemedicine

Background and Objectives: Due to the impact of controlling BMI on life, BMI classification based on facial features can be used for developing Telemedicine systems and eliminating the limitations of measuring tools, especially for paralyzed people. So that physicians can help people online during the Covid-19 pandemic. Method: In this study, new features and some previous work features were e...

متن کامل

Image Classification via Sparse Representation and Subspace Alignment

Image representation is a crucial problem in image processing where there exist many low-level representations of image, i.e., SIFT, HOG and so on. But there is a missing link across low-level and high-level semantic representations. In fact, traditional machine learning approaches, e.g., non-negative matrix factorization, sparse representation and principle component analysis are employed to d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1703.07638  شماره 

صفحات  -

تاریخ انتشار 2017